CoCoNat: a novel method based on deep learning for coiled-coil prediction

Abstract Motivation Coiled-coil domains (CCD) are widespread in all organisms and perform several crucial functions. Given their relevance, the computational detection of CCD is very important for protein functional annotation. State-of-the-art prediction methods include the precise identification of CCD boundaries, the annotation of the typical heptad repeat pattern along the coiled-coil helices as well as the prediction of the oligomerization state. Results In this article, we describe CoCoNat, a novel method for predicting coiled-coil helix boundaries, residue-level register annotation, and oligomerization state. Our method encodes sequences with the combination of two state-of-the-art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field for CCD identification and refinement. A final neural network predicts the oligomerization state. When tested on a blind test set routinely adopted, CoCoNat obtains a performance superior to the current state-of-the-art both for residue-level and segment-level CCD. CoCoNat significantly outperforms the most recent state-of-the-art methods on register annotation and prediction of oligomerization states. Availability and implementation CoCoNat web server is available at https://coconat.biocomp.unibo.it. Standalone version is available on GitHub at https://github.com/BolognaBiocomp/coconat.


Introduction
Coiled-coil domains (CCD) in proteins are structural motifs where a-helices pack together in an arrangement called knobs into holes (Crick 1952(Crick , 1953a. Since the first crystallographic observation in the structure of influenza virus hemagglutinin (Wilson et al. 1981), CCDs have been resolved in several proteins through all the kingdoms of life (Truebestein and Leonard 2016). CCDs are present, among others, in structural proteins, transcription factors, and enzymes (Truebestein andLeonard 2016, Lupas andBassler 2017). CCDs act as molecular spacers, influence the organelle organization, constrain the distance of residues involved in binding and catalytic sites, mediate membrane fusion, and are involved in signal transduction and solute transport.
Canonical CCDs include the interaction of two or more a-helices, wound around each other and forming a supercoiled bundle. Each helix is characterized by the repetition of a seven-residue motif (heptad repeat) whose positions are referred to as registers and are labeled as abcdefg. Positions a and d are routinely occupied by hydrophobic residues and mediate the interaction between different helices in the domain. As a result of the formation of the supercoil bundle, the effective periodicity of a-helices in CCDs changes from 3.6 to 3.5 residues per turn , Lupas et al. 2017, Szczepaniak et al. 2020. This implies that residues in the same register lie on the same side of the helix surface. Therefore, the hydrophobic nature of residues in registers a and d confers a peculiar amphipathic character to the a-helix. In CCDs, a-helices interact with each other through their hydrophobic face . CCDs can contain helices characterized by non-canonical repeats, longer than seven residues, i.e. hendecades, pentadecades, and nonadecades Gruber 2005, Szczepaniak et al. 2020). This article does not take into consideration these cases.
CCDs are classified according to the number and orientation of the involved a-helices, i.e. by their oligomerization state. CCDs based on the orientation of helices are classified as parallel or antiparallel and based on the number of helices as dimers, trimers, and tetramers.
Recently protein language models improved sequence encoding procedures. Here we introduce CoCoNat which, for the first time, adopts a sequence encoding based on the combination of two state-of-the-art protein language models, ProtT5 (Elnaggar et al. 2021) and ESM2 (Lin et al. 2023) and based on deep-learning computes: (i) the coiled-coil helix boundaries; (ii) the residue-level register annotation, and (iii) the CCD oligomerization state.
We trained CoCoNat on a dataset comprising 2198 proteins containing CCDs and 9062 proteins without CCD (negative examples). When tested on a blind test set including 400 CCD and 318 non-CCD proteins, CoCoNat scores with a performance that is superior to the current state-of-the-art both for residue-level and segment-level CCD detection. Moreover, CoCoNat significantly outperforms other methods, on both register annotation and prediction of CCD oligomerization state.

Datasets
CoCoNat is trained and tested on the same datasets adopted in CoCoPRED (Feng et al. 2022). Numbers are summarized in Table 1 and all datasets are available at the CoCoNat website (https://coconat.biocomp.unibo.it). Additional statistics on the oligomeric state classification of coiled-coil helices in the training and testing sets are reported in Supplementary  Fig. 1.

Training dataset
The positive training dataset contains 2191 proteins out of the 2337 included in CoCoPRED and deriving from 30,227 CCD-containing proteins annotated with SOCKET (Walshaw and Woolfson 2001) in the CCþ database (Testa et al. 2009).
Briefly, proteins included in the positive training set of CoCoPRED have been selected with the following criteria: (i) protein structure resolution < 4 Å , (ii) protein length between 25 and 700 residues, (iii) length of CCD helices ! 8 residues; (iv) absence of non-canonical (i.e. non-heptad based) CCDs; (v) CCD oligomeric state classified as parallel or antiparallel dimer, trimer, and tetramer; (vi) sequence pairwise identity lower than 30% with respect to proteins in the testing set (see below); and (vii) internal pairwise sequence identity lower than 50%.
Since CoCoNat relies on full length proteins for the computation of sequence embeddings, we applied further filters to the CoCoPRED positive training dataset, removing: (i) proteins not mapped into UniProt; (ii) synthetic and fusion proteins; and (iii) proteins whose structure coverage with respect to the UniProt sequence is lower than 70%. After this screening, the positive dataset includes 2191 proteins with 4342 coiled-coil helices, whose length ranges from 8 to 145 residues. The number of coiled-coil helices per protein ranges from 1 to 19.
The negative training set of CoCoNat includes 9040 proteins. This derives from the 9358 proteins of the negative set of CoCoPRED that was obtained from the negative set of DeepCoil (Ludwiczak et al. 2019) after the exclusion of proteins with a sequence identity > 30% with respect to the blind test set (see Section 2.1.2) and with a sequence identity > 50% with respect to the positive training set. We filtered out proteins not mapped into UniProt.
Proteins in the training set (both positive and negative examples) were split into 10 subsets for 10-fold crossvalidation. To reduce the redundancy among cross-validation sets, proteins sharing more than 25% sequence identity at 50% coverage are clustered in the same set. Cross-validation sets were used to set all the hyperparameters.

Blind test dataset
To test and benchmark CoCoNat with other available methods, we adopted the 718 proteins included in the CoCoPRED test set. The CoCoPRED set shares less than 30% sequence identity with proteins in the training sets of CCHMM_PROF (Bartoli et al. 2009), MARCOIL (Delorenzi and Speed 2002), Multicoil2 (Trigg et al. 2011), CoCoPRED (Feng et al. 2022), and DeepCoil2 (Ludwiczak et al. 2019).
Since coiled-coil annotation for this dataset was not available from the CoCoPRED website, we ran SOCKET in house on the structure of the main biological assembly of each PDB included in the dataset. By this, 400 proteins are annotated as containing CCD while 318 do not contain any coiled-coil segment. Overall, the 400 positive proteins contain 863 coiledcoil helices (Table 1).

Protein encoding
CoCoNat makes use of residue embeddings obtained with large-scale protein language models (pLMs) to represent proteins in training and testing sets. Specifically, we adopted two state-of-the-art pLMs: ProtT5 (Elnaggar et al. 2021) and ESM2 (Lin et al. 2023), generating, for each residue in the protein sequence, 1024 and 1280 features, respectively. The ESM2 model has been released in several versions, based on different transformer architectures with a varying number of cascading layers. In order to limit the resources required to compute the representations and having an embedding dimension comparable to that provided by ProtT5, we adopted the intermediate model, providing representations of 1280 components and comprising 33 transformer layers with 650M of parameters.
Residue-level representations are then concatenated together, leading to vectors of 2304 dimensions for each residue in the sequences. The concatenation of embeddings obtained with different pLMs has been shown to improve the performance in previous works (Manfredi et al. 2022(Manfredi et al. , 2023).

CoCoNat architecture
CoCoNat is organized as a three-step method combining a deep learning approach, a probabilistic graphical model, and a single-layer neural network (NN) in a cascading way. The three different steps of the model are trained independently from each other. The first two steps are collectively devised for detecting coiled-coil helix boundaries and the residue-level annotation of registers within each predicted helix. The third step predicts the CCD oligomerization state (Fig. 1). In the following sections, each step is briefly described.

Deep learning architecture
The first step ( Fig. 1) is based on a convolutional layer (LeCun et al. 1989) followed by a long short-term memory (LSTM) layer (Hochreiter and Schmidhuber 1997). The convolutional layer captures local dependencies of the input data. This layer, adopting a 15 residue long sliding window, takes as input the protein, where each residue is represented with a 2304-feature vector. By applying 40 different filters, the layer outputs the same protein with residues encoded with 40-feature vectors. This mapping is provided as input to a LSTM layer including 128 output neurons, which captures long-range dependencies. Finally, we apply a standard feedforward network (with 64 hidden neurons) with 8 output neurons (one for each possible coiled-coil registers, a-b-c-de-f-g, plus one, i, for non-CCD residues), endowed with a sigmoid activation function. The output gives the per-residue probability of each register or none. In order to reduce overfitting, we introduce dropout layers between convolutional, LSTM, and the feed-forward network with rate fixed to 0.25. The architecture was trained with a gradient descent of the Kullback-Leibler divergence error function with the Adam optimization algorithm (Kingma and Ba 2017). The Kullback-Leibler loss was chosen since it wellfits with models providing probability distributions in output, as in our case. The best model was determined using the early stopping technique of 10 epochs in which the validation error did not decrease.

Refining the prediction with Grammatical-Restrained
Hidden Conditional Random Field (second step) The second step (Fig. 1) takes in input the probabilites computed by step 1. Grammatical-Restrained Hidden Conditional Figure 1. Workflow of the predictive model of CoCoNat, comprising three steps. The step 1 maps each residue, encoded with a 2304-dimensional vector (the concatenation of ProtT5 and ESM2 embeddings) into an eight-dimensional vector representing the probability distribution over the labels (a-g registers for coiled-coil helices and i for non-CCD portions). It combines in cascade (i) a convolutional layer that computes a 40-dimensional representation for each position in the sequence, based on a sliding window of 15 contiguous residues, (ii) an LSTM layer that analyzes the whole sequence and provides a 128-dimensional representation for each position, and (iii) a fully connected feed-forward NN that, residue by residue, provides the 8dimensional mapping. The step 2 consists of a GRHCRF that casts the grammar of CCDs in the topology of the connections among 20 different states. Each sequence is generated by a path that starts from the BEGIN state and can either enter the self-looping i0 state, which models the N-terminal non-CCD portion of the protein, or the first eight-state block (1a-1b-1c-1d-1e-1f-1g-1H), which models the first CCD domain. Labels a-g of the states in the block correspond to registers and state H accommodates non-regular transitions. Residues after the first coiled-coil helix are modeled by the i1 state (non-CCD) and, in case, by a second CCD block (2a-2b-2c-2d-2e-2f-2g-2H), analogous to the first one. All states, but 1H and 2H, can make transition to the END state, terminating the path. GRHCRF provides the annotation of coiled-coil helix boundaries and of registers by computing the optimal a posteriori Viterbi path, given the probabilities computed in the step 1. The step 3 provides the prediction of the oligomeric state, based on the annotation computed in step 2 and on the 2304-dimensional embedding. For each predicted coiled-coil helix, embeddings labeled with registers a and d are separately averaged and fed into a feed-forward NN that computes the probability distribution over the four possible classes (parallel and antiparallel dimer, trimer, and tetramer). The three steps of the architecture are trained separately.
Random Field (GRHCRF) is a discriminative probabilistic model , Madeo et al. 2021, and it allows to introduce the regular grammar of the CDD registers. In the prediction phase, a posterior-Viterbi dynamic-programming algorithm computes the most probable path along the model, satisfying the grammatical constraints.
In our model, the grammar has two identical blocks for CCD prediction, two states (i0 and i1 in Fig. 1, step 2) with a self-loop modeling the non-CCD regions, one BEGIN and one END states. In each CCD block, seven states model the register sequence, and one further state (H, 1, and 2) accommodates non-regular transitions (i.e. transitions escaping the regular heptad repeat pattern, abcdefg). The first GRHCRF block models the first coiled-coil helix and the second one models all the others coiled-coil helices, when present, in the protein sequence.
The GRHCRF output defines the precise identification of CCD boundaries and annotates the typical heptad repeat pattern along the coiled-coil helices.

Prediction of the oligomerization state
The prediction of the CCD oligomerization state adopts a simple feed-forward NN with a single hidden layer, comprising 128 neurons, and four output units corresponding to the four possible oligomerization states: parallel and antiparallel dimers, trimers, and tetramers.
The input of this network is built based on a well-known biophysical feature: the oligomeric state of canonical CCD is largely determined by the nature of hydrophobic residues in the heptad repeat pattern, namely, residues labeled with registers a and d (Woolfson 2023;Li et al. 2016). Based on this observation, for a given coiled-coil helix, the network input is obtained concatenating the average embedding vectors of a and d predicted positions.
More formally, given a coiled-coil helix of length l, with an embedding matrix E of dimension l Â 2304 (as derived from the concatenation of two pLM embeddings, ProtT5 and ESM2), the following two mean vectors are computed: where N a and N d are the number of positions labeled with registers a and d, respectively. The input vector for the network is then obtained concatenating e a and e d : x ¼ e a Á e d where Á ½ denotes the vector concatenation operator.

Model training and selection procedure
The whole CoCoNat architecture was trained with three independent training procedures for steps 1, 2, and 3. The deep-learning architecture of step 1 was optimized using 10-fold cross-validation and a grid search to select the main model hyperparameters. Specifically, we selected the best number of convolutional filters (testing values in the set 10, 20, 40, 80, 160), the optimal LSTM hidden output size (testing values in the set 32, 64, 128, 256), and the optimal number of hidden neurons in the final feed-forward network (testing values in the set 12, 32, 64, 128, 256). The optimal configuration was chosen as the one maximizing the crossvalidation F1 score at residue level (see next section), and include 40 convolutional filters, 128 units for the LSTM size and 64 neurons in the hidden layer of the final feed-forward network.
For step 2, the hyperparameters of the GRHCRF include the r 2 used for L2 regularization and the number of training iterations ). These were optimized in cross-validation and grid search testing various combinations (r 2 in the set 0.0001, 0.001, 0.05, 0.1 and iterations in the set 10, 20, 40, 100). The optimal values are r 2 ¼ 0.05 and 40 iterations.
Finally, step 3 architecture only required to optimize the number of hidden neurons. Again, we tested different values (16,32,64,128,256) and selected the optimal one as those maximizing the average MCC across the four oligomeric states. The best value is 128.

Scoring performance
To evaluate the performance of our method in recognizing coiled-coil helices, we adopted residue-and segment-based measures.
where TP R , FP R , and FN R are true positive, false positive, and false negative coiled-coiled residues, respectively. Analogously, the segment-based scores include precision (PRE S ), recall (REC S ), and F1-score (F1 S ): In this case, TP S , FP S, and FN S are computed for the coiledcoil helices. Prediction is considered correct (TP S ) only if the overlap between predicted and observed segments is at least equal to the half-length of the longest segment.
We computed the precision-recall (PR) curve and the relative area under the curve (PR-AUC), by plotting the two measures at varying thresholds of coiled-coil probabilities as obtained from the GRHCRF posterior probability values.
For the register and oligomeric state prediction tasks, we reported class-level MCC values: 3 Results

Visualizing ProtT5 and ESM2 embeddings with t-SNE
For evaluating whether the two language models capture coiled-coil-related features in their representations, we adopted t-SNE (van der Maaten and Hinton 2008) to project raw embedding vectors obtained with ProtT5 and ESM2 in two dimensions. First, we wanted to assess if the raw embeddings can distinguish the different register features. To this aim, we projected all representations of coiled-coil residues in the training set, and then we colored projected points by the heptad repeat register they are annotated with, distinguishing two classes: hydrophobic register positions (a and d) and polar positions (b, c, e, f, and g). Results are shown in Supplementary Fig.  2A. From the projections, even if a clear separation is missing, it is evident that the two models can capture the hydrophobic/ polar register difference to a good extent. Among the two models, ProtT5 seems to provide a slightly better separation.
Second, to investigate if the embedding can capture the relation between hydrophobic registers and oligomeric state, we projected all representations of hydrophobic register positions in the training set (a and d), and then we colored projected points according to the oligomeric state of the corresponding helix. Resulting t-SNE projections are shown in Supplementary Fig. 2B. In this case, the separation is less evident, even if some clusters are visible (mostly in ProtT5 projections) for parallel dimers and trimers. Possibly, more t-SNE dimensions are needed to achieve a better separation.
Overall, these experiments suggest that information is already present in the raw embeddings. However, a specific transfer learning architecture, processing the whole sequence, and further exploiting the local and global sequence context is needed to better capture coiled-coil features.

Cross-validation results
We performed 10-fold cross-validation experiments to evaluate the contribution from the two protein language models. We independently trained three different identical models adopting as input: (i) ProtT5, (ii) ESM2, and (iii) both encodings combined in a single vector. Results are reported in Table 2. The overall prediction achieved with each one of the two language models is similar (F1 R and F1 S scores equals 0.44, 0.34, and 0.46, 0.37 with ProtT5 and ESM2, respectively). However, combining the two embeddings into a single vector leads to better performances, raising both precision and recall values, and achieving F1 R and F1 S values of 0.52 and 0.41, respectively. This suggests that the two models are somewhat complementary. We analyzed and compared residue-level true positive predictions obtained with inputs based on the two pLMs. We find that the two models individually trained with ProtT5 and ESM2 share 25 373 correctly predicted residues, and that 4166 and 8341 coiled-coil residues are correctly and uniquely identified, respectively. This finding highlights the complementarity of the two models. These results agree with previous works in which the combination of embeddings from different pLMs has been proven effective also for other prediction tasks (Manfredi et al. 2022(Manfredi et al. , 2023. All results presented in this manuscript are obtained using the combination of the two input embeddings.

Prediction of coiled coils on the blind test set
CoCoNat is benchmarked on the same blind test set against available methods. In Table 3, we report results for the prediction of coiled-coil helices. Tested methods include: MARCOIL (Delorenzi and Speed 2002), CCHMM_prof (Bartoli et al. 2009), Multicoil2 (Trigg et al. 2011), DeepCoil2 (Ludwiczak et al. 2019, and CoCoPRED (Feng et al. 2022). All the methods were run in house using the respective available standalone versions. Since DeepCoil2 does not provide a classification but rather a probability value, we report results obtained by applying two different probability thresholds set to 0.2 and 0.5, respectively.
CoCoNat, which adopts encodings based on ProT5 and ESM2, outperforms the state-of-the-art, showing (with respect to the second top performing method in the benchmark) an improvement in the per-residue precision value (0.55), with a slight loss in recall (0.53), which is reflected in the higher value of the F1-score (0.54). The per-segment scores of CoCoNat confirm this trend. Moreover, CoCoNat performs with the highest SOVp and the third highest SOVo (see Section 2.4 for definition).
For sake of assessing the significance of the differences observed in data reported in Table 3, we performed a boostrapping procedure and a two-sample Welch's t-test. Specifically, from the blind test set results, we randomly selected 100 samples of 300 sequences and evaluated all the methods using residue-and segment-level scoring measures. Then, average performances of CoCoNat were compared for statistical significance with average scores of other tools. All the differences observed in this experiment reflect those reported in Table 3 and are all significant at 0.0001 significance threshold. Results are reported in Supplementary Table 1.

Prediction of coiled-coil registers
We compared CoCoNat with other tools in the task of annotating heptad repeat registers. To this aim, we used the blind test of 400 proteins endowed with CCDs. Results of all  Fig. 1. The training set is described in Table 1. Results are obtained adopting a 10-fold cross-validation. For details, see Section 2. Subscript R: per residue; subscript S: per CC helix segment. Variability across cross-validation sets is lower than 1% for all scores. methods were generated using the respective standalone versions (Table 4).
For all the register labels a-g, CoCoNat MCC values indicate an improvement ranging from 16% to 19% with respect to the second best-performing method, CoCoPRED.
Remarkably, CoCoNat MCC values are quite similar across all the seven different register labels, suggesting that the register is routinely predicted in the correct regular configuration, from label a to g. This highlights that the GRHCRF grammar is properly capturing transition constraints among the different registers within the coiled-coil segments.

Prediction of the CCD oligomerization state
We finally compared CoCoNat, LOGICOIL (Vincent et al. 2013) CoCoPRED (Feng et al. 2022) on the task of predicting CCD oligomerization state. Again, we used the blind test set of 400 proteins as benchmark. The three methods are compared assuming an oracle predictor for the identification of CCD segments (i.e. we classify real CCD segments into the four oligomerization classes). In Table 5, we report a comparison of the three approaches in terms of per-class MCCs.
Looking at the MCC values, CoCoNat significantly overpasses CoCoPRED and LOGICOIL, providing predictions that are overall higher and more balanced across the four oligomerization state classes. Remarkably, CoCoNat outperforms other tools also on less abundant classes, i.e. trimers and tetramers.

CoCoNat availability and analysis of running time
We release CoCoNat as both web server and standalone version. The web server is available at https://coconat.biocomp. unibo.it. The server provides a user-friendly web interface, allowing the user to choose between two modes of use of the tool: (i) analysis and visualization of coiled-coil prediction results for a single input sequence; (ii) submission of a batch job allowing prediction of coiled-coils and download of results (in TSV and JSON formats) for up to 500 sequences per job. Additionally, for single-sequence mode, we also provide the possibility of uploading a pre-determined set of coiled-coil segments, restricting the prediction to the oligomeric state only. The web application is implemented using Django (version 4.0.4), Bootstrap (version 5.3.0), JQuery (version 3.6.0), and neXtProt FeatureViewer (version 1.3.0-beta6) for visualization of predicted coiled-coil segments along the sequence.
The standalone version of CoCoNat is available on GitHub at https://github.com/BolognaBiocomp/coconat. The standalone tool is implemented in Python as a Docker containerized application. This avoids the installation of dependencies and allows users to quickly install the program in any server equipped with Docker. Instructions on how to build the Docker image and run CoCoNat are available on the GitHub repository.
We performed experiments to evaluate the running time of CoCoNat in different conditions. All the experiments were performed on the virtual machine hosting the web server,    Madeo et al.
equipped with AMD EPYC 7301 12-Core Processor, 48G RAM. No GPU is available on this machine. First, we analyzed the impact of the protein sequence length on the running time. To this aim, we randomly selected different sets of 100 sequences with lengths of increasing size. Fifty samples are generated for each length bin. CoCoNat has been then executed on each sample, evaluating its running time. Results are shown in Supplementary Fig. 3A. The running time scales linearly with the length of the sequences, ranging from 200 s (for 100 sequences) when protein length is 50-100 residues to 1400 s when length is 600/700 residues.
Second, we analyzed how the number of sequences in the dataset impacts on the running time. Again, random samples of sequences were generated, varying from 10, 20, 40, 80, 160, 320, and 500. The length of sequences was set between 100 and 200 residues for all samples. Results are reported in Supplementary Fig. 3B. The time required for the datasets including 10, 20, and 40 sequences is almost identical. This is due to the overhead required to load the two pLMs for encoding, which dominates the overall running time when the number of sequences is low. For dataset sizes including more than 40 sequences, the running time scales almost linearly from 100 to 1400 s.
In general, the CoCoNat running time is always low if compared to the time required by other tools based on multiplesequence alignment inputs, such as CoCoPRED. For instance, to predict coiled-coil helices, including registers and oligomeric state, on 100 sequences of length comprised between 100 and 200 residues, the CoCoNat average running time is 330 s (5.5 min), which is lower than the time required by CoCoPRED, requiring about 2.5 h.

Conclusion
In this article, we described CoCoNat, a novel method based on protein language model embeddings and deep learning for detection of coiled-coiled helices at residue level, prediction of coiled-coil heptad repeat registers, and oligomerization state.
Training and testing were performed on datasets derived from literature. When compared with other state-of-the-art tools, CoCoNat reports performance that are significantly better than those obtained by other approaches tested, in particular when considering register and oligomerization state prediction.
In this work, we also proved the relevance of adopting protein residue representations derived from large-scale protein language models such as ProtT5 (Elnaggar et al. 2021) and ESM2 (Lin et al. 2023) for this specific task. Moreover, we further confirmed that the combination of different language models provides better performance, suggesting that different models obtained with different architectures and data give complementary representations.

Supplementary data
Supplementary data are available at Bioinformatics online.